perf: Replace HashSet with generational vec by michael-weigelt · Pull Request #261 · bytecodealliance/regalloc2

michael-weigelt · 2026-06-05T17:51:39Z

When compiling certain types of functions (e.g. with many locals), a lot of time is spent in try_to_allocate_bundle_to_reg, in particular on HashSet::insert. This data structure is only used for deduplicating and the cost can be quite high (hashing, allocations, bad cache properties).

Replacing it with a generational vector gives significant improvements for my ad-hoc benchmark (deliberately compiled with no optimizations, because my use-case currently demands it):
Compiling a Wasm function with X locals before and after this patch:

1000 locals: 780ms compilation -> 363ms   -53%
10k  locals:  66s compilation ->  23s     -65%
40k  locals: 952s compilation ->  356s    -62%

The Sightglass benchmarks are mixed, see the comment below.

michael-weigelt · 2026-06-05T17:55:24Z

@cfallin I was told you might want to have a look at this.

michael-weigelt · 2026-06-05T18:12:50Z

This is a small and (maybe too) focused improvement in the direction that is also discussed here: #87

cfallin · 2026-06-05T18:58:07Z

@michael-weigelt thanks for this. It's an interesting optimization. Could you build a Wasmtime engine (wasmtime-bench-api target) with this version of regalloc2 and with baseline, and test both with Sightglass over the full suite?

Those speedups are frankly fantastic if real but I'm also more than a little skeptical about purported 2-3x speedups; I've done a lot of profiling of regalloc2 and, at least last time I did this (a few years ago now), didn't see major hotspots outside of the expected ones (e.g. the main allocation maps). That said, I welcome benchmark results that say otherwise if reliable!

Data-structure-wise, I was initially skeptical in reading the patch about the dense representation but I see that it's a reused set so we will only allocate one per function compilation -- so it's O(n) time and space cost overall, which is acceptable. It's still worse than the O(k) cost for k << n (very few conflicts), so I do want a careful evaluation over the suite.

Side-note: in BA we now have the AI tools policy, and while you are welcome to use tools such as Claude privately (many of us do for various purposes), you are fully and solely responsible for interactions and discussions here, so instead of "Claude said [thing you're unsure about]", please speak only about things you've personally understood/vetted and don't relay raw agent thoughts or output. In this case that vetting includes running the benchmarks, and carefully reasoning about the algorithmic tradeoff as humans ourselves. Thanks!

cfallin · 2026-06-05T19:00:13Z

Also: I would want to see that Wasmtime+Cranelift pass all tests wth the modified regalloc2, just to make sure we're not missing something wrt changed output (which could also explain huge compile-time wins if e.g. we suddenly treat conflicts differently and allocate less optimally or whatever). The algorithm looks correct but let's verify.

michael-weigelt · 2026-06-05T20:24:09Z

Thanks for the pointers, I'll run those benchmarks. I don't expect the speedups of the general cases to be this impressive, but if they are at least not worse, I am happy.

you are fully and solely responsible for interactions and discussions here, so instead of "Claude said [thing you're unsure about]", please speak only about things you've personally understood/vetted

Understood and agreed. The whole point of that sentence was: I want to vet this. For the record, because I am a new contributor here: The only AI contributions in this PR are the brainstorming on how to address the problem and code. I would never relay slop to a human or let it talk/write prose for me.

michael-weigelt · 2026-06-09T11:48:10Z

Wasmtime+Cranelift pass all tests wth the modified regalloc2

I ran them all (cargo test in the wasmtime/cranelift directory and also at the root) and observed no failures.

I ran all.suite on Sightglass (excluding only splay.wasm because that seems to miss a gc flag somewhere), and the results are attached here. I ran it once with the default engine settings, and once with no optimizations (which is my use case). The results are very mixed. One way to summarize them:

With noopt, baseline is faster in 6 benchmarks and slower in 11 benchmarks than this PR.
With opt, baseline is faster in 14 benchmarks and slower in 17 benchmarks than this PR.

But it's quite variable by how much they differ.

I am interested to hear your thoughts about this, @cfallin. I don't know what operations points are most important for this library, in particular idk if you care about the noopt case. Furthermore, the PR is motivated by a uncommon Wasm module, so I do realize I have the double disadvantage of a special operation point and a special input. All that is to say: I'd understand if you reject this PR. But in its favour at least I'd say it does not look clearly worse on average, and massively better in the very special case.

all_with_noopt.txt
all_normal.txt

cfallin · 2026-06-09T14:46:52Z

Thanks, @michael-weigelt . I think the best way forward here is to be objective with the benefit of data: if the PR causes more compilation slowdowns than speedups, it's not worth it overall (for a general-purpose compiler). I guess I'd that this is not too surprising when the initial evaluation is a Wasm of a "special" shape -- undoubtedly one could specialize the compiler for certain corners of the parameter space and get better performance in those cases, but we unfortunately have to build a compiler that works well for most typical inputs. Thanks nevertheless for your efforts here!

michael-weigelt added 2 commits June 5, 2026 17:35

initial change

b075528

same-as-last bundle

ad389a2

michael-weigelt mentioned this pull request Jun 5, 2026

perf: Replace BTreeMap with sorted vec michael-weigelt/regalloc2#1

Draft

michael-weigelt commented Jun 5, 2026

View reviewed changes

Comment thread src/ion/data_structures.rs Outdated

u64

14c22e7

cfallin closed this Jun 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Replace HashSet with generational vec#261

perf: Replace HashSet with generational vec#261
michael-weigelt wants to merge 3 commits into
bytecodealliance:mainfrom
michael-weigelt:mwe/conflict_set

michael-weigelt commented Jun 5, 2026 •

edited

Loading

Uh oh!

michael-weigelt commented Jun 5, 2026

Uh oh!

michael-weigelt commented Jun 5, 2026

Uh oh!

cfallin commented Jun 5, 2026

Uh oh!

cfallin commented Jun 5, 2026

Uh oh!

michael-weigelt commented Jun 5, 2026

Uh oh!

Uh oh!

michael-weigelt commented Jun 9, 2026 •

edited

Loading

Uh oh!

cfallin commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

michael-weigelt commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michael-weigelt commented Jun 5, 2026

Uh oh!

michael-weigelt commented Jun 5, 2026

Uh oh!

cfallin commented Jun 5, 2026

Uh oh!

cfallin commented Jun 5, 2026

Uh oh!

michael-weigelt commented Jun 5, 2026

Uh oh!

Uh oh!

michael-weigelt commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cfallin commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

michael-weigelt commented Jun 5, 2026 •

edited

Loading

michael-weigelt commented Jun 9, 2026 •

edited

Loading